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Abstract 

In this paper, we consider a generalized longest common subsequence problem with multiple substring 
inclusive constraints. For the two input sequences X and Y of lengths n and m, and a set of d constraints 
P = {Pi,---,Pd} of total length r, the problem is to find a common subsequence Z of X and Y 
including each of constraint string in P as a substring and the length of Z is maximized. A new dynamic 
programming solution to this problem is presented in this paper. The correctness of the new algorithm 
is proved. The time complexity of our algorithm is 0{d2'^nmr). In the case of the number of constraint 
strings is fixed, our new algorithm for the generalized longest common subsequence problem with multiple 
substring inclusive constraints requires 0(nmr) time and space. 


1 Introduction 

The longest common subsequence (LCS) problem is a classic computer science problem, and has applications 
in bioinformatics. It is further widely applied in diverse areas, such as file comparison, pattern matching and 
computational biology[31lll[Hl[S]- Given two sequences X and V, the longest common subsequence problem 
is to find a subsequence of X and Y whose length is the longest among all common subsequences of the two 
given sequences. It differs from the problems of finding common substrings: unlike substrings, subsequences 
are not required to occupy consecutive positions within the original sequences. The most referred algorithm, 
proposed by Wagner and Fischer |29j . solves the LCS problem by using a dynamic programming algorithm 
in quadratic time. Other advanced algorithms were proposed in the past decades m El m nn na [m El]. 
If the number of input sequences is not fixed, the problem to find the LCS of multiple sequences has been 
proved to be NP-hard |23j . Some approximate and heuristic algorithms were proposed for these problems 

[SIESI- 

For some biological applications some constraints must be applied to the LCS problem. These kinds of 
variants of the LCS problem are called the constrained LCS (CLCS) problem. One of the recent variants of 
the LCS problem, the constrained longest common subsequence (CLCS) which was first addressed by Tsai 
has received much attention. It generalizes the LCS measure by introducing of a third sequence, which 
allows to extort that the obtained CLCS has some special properties [25]. For two given input sequences X 
and Y of lengths m and n, respectively, and a constrained sequence P of length r, the CLCS problem is to 
find the common subsequences Z of X and Y such that P is a subsequence of Z and the length of Z is the 
maximum. The most referred algorithms were proposed independently 011], which solve the CLCS problem 
in 0{mnr) time and space by using dynamic programming algorithms. Some improved algorithms have also 
been proposed mm- The LCS and CLCS problems on the indeterminate strings were discussed in [20] . 
Moreover, the problem was extended to the one with weighted constraints, a more generalized problem [24] . 

Recently, a new variant of the CLCS problem, the restricted LCS problem, was proposed [14], which 
excludes the given constraint as a subsequence of the answer. The restricted LCS problem becomes NP- 
hard when the number of constraints is not fixed. Some more generalized forms of the CLCS problem, the 
generalized constrained longest common subsequence (GC-LCS) problems, were addressed independently by 
Chen and Chao [7]. For the two input sequences X and Y of lengths n and to, respectively, and a constraint 
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string P of length r, the GC-LCS problem is a set of four problems which are to find the LCS of X and 
Y including/excluding P as a subsequence/substring, respectively. The four generalized constrained LCS [7] 
can be summarized in Table 1. 


Table 1: The GC-LCS problems 


Problem 

Input 

Output 

SEQ-IC-LCS 

X,y,and P 

The longest common subsequence of X and Y 
including P as a subsequence 

STR-IC-LCS 

X,Y, and P 

The longest common subsequence of X and Y 
including P as a substring 

SEQ-EC-LCS 

X,Y, and P 

The longest common subsequence of X and Y 
excluding P as a subsequence 

STR-EC-LCS 

X,Y, and P 

The longest common subsequence of X and Y 
excluding P as a substring 


For the four problems in Table 1, 0{mnr) time algorithms were proposed [7]. For all four variants in 
Table 1, 0{r{m + n) + {m + n) log(m -I- n)) time algorithms were proposed by using the finite automata |12) . 
Recently, a quadratic algorithm to the STR-IC-LCS problem was proposed m, and the time complexity of 
[T^ was pointed out not correct. 

The four GC-LCS problems can be generalized further to the cases of multiple constraints. In these gen¬ 
eralized cases, the single constrained pattern P will be generalized to a set of d constraints P = {Pi, • • •, P^} 
of total length r, as shown in Table 2. 


Table 2: The Multiple-GC-LCS problems 


Problem 

Input 

Output 

M-SEQ-IC-LCS 

X,Y, and a set of constraints 
p = {Pi,..-,P4 

The longest common subsequence of X and Y 
including each of constraint Pi € P as a, subsequence 

M-STR-IC-LCS 

X,Y, and a set of constraints 
p = {Pi,---,P4 

The longest common subsequence of X and Y 
including each of constraint Pi G P as a substring 

M-SEQ-EC-LCS 

X,Y, and a set of constraints 
p = {Pi,...,P4 

The longest common subsequence of X and Y 
excluding each of constraint Pi G P as a subsequence 

M-STR-EC-LCS 

X,Y, and a set of constraints 
p = {Pi,...,P4 

The longest common subsequence of X and Y 
excluding each of constraint Pi G P as a substring 


The problem M-SEQ-IC-LCS has been proved to be NP-hard in [13]. The problem M-SEQ-EC-LCS has 
also been proved to be NP-hard in [T4l|28|. In addition, the problems M-STR-IC-LCS and M-STR-EC-LCS 
were also declared to be NP-hard in [7], but without a proof. The exponential-time algorithms for solving 
these two problems were also presented in |7|. 

We will discuss the problem M-STR-IC-LCS in this paper. The failure functions in the Knuth-Morris- 
Pratt algorithm [22] for solving the string matching problem have been proved very helpful for solving the 
STR-IC-LCS problem. It has been found by Aho and Corasick]!] that the failure functions can be generalized 
to the case of keyword tree to speedup the exact string matching of multiple patterns. This idea can be very 
helpful in our dynamic programming algorithm. This is the principle idea of our new algorithm. 

The organization of the paper is as follows. 

In the following 4 sections, we describe our presented dynamic programming algorithm for the M-STR- 
IC-LCS problem. 

In Section 2 the preliminary knowledge for presenting our algorithm for the M-STR-IC-LCS problem is 
discussed. In Section 3 we give a new dynamic programming solution for the M-STR-IC-LCS problem with 
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Figure 1: Keyword Trees 


time complexity 0 {d2‘^nmr), where n and m are the lengths of the two given input strings, and r is the 
total length of d constraint strings. In Section 4, we discuss the issues to implement the algorithm efficiently. 
Some concluding remarks are provided in Section 5. 


2 Preliminaries 

A sequence is a string of characters over an alphabet A subsequence of a sequence X is obtained by 
deleting zero or more characters from X (not necessarily contiguous). A substring of a sequence A is a 
subsequence of successive characters within X. 

For a given sequence X = xiX 2 • ■ ■ Xn of length n, the ith character of X is denoted as Xi G ^ for 
any i = 1, • • •, n. A substring of X from position i to j can be denoted as X[i : j] = XiXi+i ■ • ■ Xj. If 
i 1 or j ^ n, then the substring X[i : j] = XiXi+i • • - Xj is called a proper substring of X. A substring 
X[i : j] = XiXi+i • • ■ Xj is called a prefix or a suffix of A if i = 1 or j = n, respectively. 

For the two input sequences A = X 1 X 2 ■ ■ ■ Xn and Y = yiy 2 ■ ■ - ym of lengths n and to, respectively, and 
a set of d constraints P = {Pi, • • •, Pd} of total length r, the problem M-STR-IC-LCS is to find an LCS of 
A and Y including each of constraint Pi G P as a substring. 

Keyword tree (Aho-Corasick Automatonl [T1 IHl I15j is a main data structure in our dynamic programming 
algorithm to process the constraint set P of the M-STR-IC-LCS problem. 

Definiton 1 The keyword tree for set P is a rooted directed tree T satisfying 3 conditions: 1. each edge is 
labeled with exactly one character; 2. any two edges out of the same node have distinct labels; and 3. every 
string Pi in P maps to some node v of T such that the characters on the path from the root ofT to v exactly 
spell out Pi, and every leaf of T is mapped to some string in P. 

In order to identify the nodes of T, we assign numbers 0,1, • • •, t — I to all t nodes of T in their preorder 
numbering. Then, each node will be assigned an integer i,0 < i < t, as shown in Fig.l. For each node 
numbered z of a keyword tree T, the concatenation of characters on the path from the root to the node i 
spells out a string denoted as Lii). The string L{i) is also called the label of the node i in the keyword 
tree T. For example, Fig.l shows the keyword tree T for the constraint set P = {aah,aba,ha}, where 
Pi = aab, P 2 = aba, P 3 = ba, and d = 3, r = 8. Clearly, every node in the keyword tree corresponds to a 
prefix of one of the strings in set P, and every prefix of a string Pi in P maps to a distinct node in the 
keyword tree T. The keyword tree for set P of total length r of all strings can be easily constructed in 0{r) 
time for a constant alphabet size. 

The keyword tree can be extended into an automaton, Aho-Corasick automaton, which consists of three 
functions, a goto function, an output function and a failure function. The goto function is represented as 
the solid edges of the keyword tree and the output function indicates when the matches occur and which 
strings are output. For each node i, its output function is denoted as Oi, a set of indices which indicates 
when the node i is reached then for each index j G Oi, the string Pj is matched. For example, the output 
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sets of nodes 3,5 and 7 are O 3 = { 1},05 = {2, 3} and O 7 = {3}, which means that the outputs of node 3,5 
and 7 are {Pi = aa&},{P 2 = a 6 a, P 3 = ba} and {P 3 = ba}, respectively. 

The failure function indicates which node to go if there is no character to be further matched. It is a 
generalization of the failure functions in the Knuth-Morris-Pratt algorithm for solving the string matching 
problem. It is represented by the dashed edges in Fig.l. 

For any node i of T, define lp{i) to be the length of the longest proper suffix of string L{i) that is a prefix 
of some string in T. It can be verified readily that for each node i of T, if A is an /p(z)-length suffix of string 
L{i), then there must be a unique node pre{i) in T such that L{pre{i)) = A. If lp{i) = 0 then pre(i) = 0 is 
the root of T. 

The ordered pair (i,pre{i)) is called a failure link. The failure link is a direct generalization of the failure 
functions in the KMP algorithm. For example, in Fig.l, failure links are shown as pointers from every node 
i to node pre{i) where lp{i) > 0. The other failure links point to the root and are not shown. The failure 
links of T define actually a failure function pre for the constraint set P. As stated in mm, for a constant 
alphabet size, in the worst case, the failure function pre can be computed in 0 {r) time. 

The failure list of a given node is the ordered list of the nodes which locate on the path to the root via 
dashed edges. For example, for the nodes i = 1, 2, 3,4,5, 6 , 7 , the corresponding values of failure function 
are pre{i) = 0,1,4, 6 , 7, 0,1. The failure list of node 5 is {7 —)■ 1 —)■ 0}, and the failure list of node 6 is {0}, 
as shown in Fig.l. 

The failure function pre is used to speedup the search for all occurrences in a text Z of strings from P. 
For each node i of T, and a character c G if no edges out of the node i is labeled c, then the failure link 
of node i direct the search to the node pre{i). It is equivalent to add the edge {i,pre{i)) labeled c to the 
node i. This set matching method generalized the next function in KMP algorithm to the Aho-Corasick-next 
function as follows. 

Definiton 2 Given a keyword tree T and its failure function, for each node i of T and each character c G 
Aho-Corasick-next function 6{i,c) denotes the destination of the first node in i’s failure list which has an 
edge labeled c. If there exists no such node in the failure list, the function returns the root. 

Table 3 shows the Aho-Corasick-next function 6 corresponding to the example in Fig.l. 


Table 3: Aho-Corasick-next function 
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We take node 4 as an example. It can be seen from Fig.l that (5(4, a) = 5 and (5(4, b) = 0. It is easy to 
see that each element of Aho-Corasick-next function can be computed in constant time. 

The symbol 0 is also used to denote the string concatenation. For example, if Si = aaa and S 2 = bbb, 
then it is readily seen that S*! 0 S '2 = aaabbb. 


3 Our Main Result: A Dynamic Programming Algorithm 

Let T be a keyword tree for the given constraint set P, and Z[1 : 1] = zi,Z 2 ,---,zi be any common 
subsequence of X and Y. If we search the set matching of Z from the root of T in the direction of the 
Aho-Corasick-next function <5 of T, then the search will stop in a node i of T. All such common subsequence 
of X and Y can be classified into a group i, 0 < i < t. These t groups are still not sufficient to distinguish 
the different states in our dynamic programming algorithm, since the common subsequence of X and Y in 
the same group may contain different subset of P. Therefore, we must divide each group into 2"^ new states 
by attaching d flags to denote the combinations which constraints have been kept. The d flags can be record 
by a d bits vector s. If the string Pj G P is kept, then the bit j of s is set to 1, otherwise 0. There are total 
2 '^ different such bit vectors, denoted as sq, si, • • •, S 2 <i_i as follows. 
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Definiton 3 


• Let 0 < j < 2^, and j = ^. Then the set Sj is defined as sj = {i \ bi = 1,1 < i < d}. 

• If a subset of strings s = {Pki, Pk 2 > ’ ’ ’ ? Pkn } ^ ^ must be added to the set Sj, then the set sj becomes 

Sk, where k = j V SiLi and the operation \/ is a bitwise or operation of two integers. In this 

case we denote Sfe = Sj IJ s. 

• For a sequence z with state {a, f) in a given keyword tree T, and a character c G consider 

the state of the sequence z = z^c inT. From the node a, the search for z will go to node a = 6 (a,c). 
If Oa, the output set of the node a is not empty, then the strings of Oa must be included in the sequence 
z, and thus the set sp will changed to sp = sp[JOa. In this case we denote 0 = ■-){a,0,c). In other 
words, the state of the z = z®c inT becomes {S{a,c),^{a, 0,c)). 

For example, in the example of Fig.l, we have d = 3, and si = {!}, sq = {2,3}, sy = {1,2,3}, 
S7 = Si Use- 

Finally we have t2‘^ different states in our dynamic programming algorithm. For each pair {i, j), 0 < i < 
0 < j < 2^, the state {i, j) represents the set of common subsequence of X and Y in group i and the subset 
of P contained in the subsequence is recorded by bit vector sj. 

Definiton 4 Let Z(i,j, (a,0)) denote the set of all LCSs of X[1 : i] and Y[1 : j] with state {a,0), where 
1 < i < n,l < j < m, and 0 < a < t,0 < 0 < 2'^. The length of an LCS in Z{i, j,{a, 0)) is denoted as 
fihj, {a, 0 )). 

If we can compute f{i, j, {a, 0)) for any 1 < t < n, 1 < j < m, and 0 < a < t,0 < 0 < 2‘^ efficiently, then 
the length of an LCS of X and Y including P must be max {/(n, m, (i, 2^-1))}. 

By using the keyword tree data structure described in the last section, we can give a recursive formula 
for computing f{i,j, {a,0)) by the following Theorem. 

Theorem 1 For the two input sequences X = X 1 X 2 ■ ■ ■ Xn and Y = yiy 2 ■ ■ ■ y-m of lengths n and m, respec¬ 
tively, and a set of d constraints P = {Pi, ■ ■ ■ ,Pd} of total length r, let Z(i,j, {a, 0)) and f{i,j, {a, 0)) be 
defined as in Definition^ Suppose a keyword tree T for the constraint set P has been built, and the t nodes 
of T are numbered in their preorder numbering. The label of the node numbered k(0 < k < t) is denoted as 
L(k). Then, for any 1 < i < n,l < j < m, and 0 < a < t,0 < 0 < 2‘^, f{i,j, {a, 0)) can be computed by the 
following recursive formula 0 - 


f{i,j, ia,0)) = 
Where, 


max{/(f - 1 , j, {a,0)),f{i,j - 1, {a,0))} 


if Xi ^ yj, 


max 


(/(* - 1 , j - 1 , (a.^)),l + _ max {f{i-l,j-l,{a,0))}\ ifxi = yj. 

L {a,P)eS{a,P,Xi) j 


S{a,0,Xi) = {(a,,S)|0 < a <t,0 < 0 < 2^,S{a,Xi) = a,j{a,0,Xi) = 0} 


( 1 ) 

( 2 ) 


The boundary conditions of this recursive formula are f{i,0, (0,0)) = /(0,j, (0,0)) = 0 for any 0 < i < 
n,Q < j < m. 


Proof. 

For any 0 < i < n,0 < j < m, and 0 < a < t, 0 < /3 < 2“^, suppose f{i,j, {a, 0)) = I and z = zi ■ ■ ■ zi G 
Z{i,j, ia, 0)). 

First of all, we notice that for each pair fi',j'), 1 < i' < n,l < f < m, such that i' < i and f < j, we 
have f{i',j', (a,/3)) < f{i,j, {a,0)), since a common subsequence z of X[1 : i'] and y[l : j'] with state {a,0) 
is also a common subsequence of X[1 : i] and F[1 : j] with state {a,0). 
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(1) In the case of xt ^ yj, we have Xi ^ zi or yj ^ zi. 

(1.1) If Xi ^ zi, then z = zi ■ ■ ■ Zi is a common subsequence of X[1 : z — 1] and F[1 : j] with state {a, (3), 
and so f{i — l,j, (a,/?)) > I- On the other hand, f{i — l,j, {a,/3)) < f{i,j, {ct,/3)) = 1. Therefore, in this 
case we have f{ij, (a,^)) = f{i - 1, j, (a,/?)). 

(1.2) If yj ^ zi, then we can prove similarly that in this case, f(i,j, {a, l3)) = f{i,j — 1, (a, /3)). 

Combining the two subcases we conclude that in the case of Xi ^ yj, we have 

(a,/3)) = max{/(z - l,j, (a,/3)),/(z, j - 1, (a,/3))}. 

(2) In the case of Xi = yj, there are also two cases to be distinguished. 

(2.1) If Xi = yj ^ zi, then z = zi - ■ ■ zi is also a common subsequence of X[1 : z — 1] and Y[1 : j — 1] with 

state (a,/3), and so /(z — l,j — 1, (a,/3)) > I- On the other hand, /(z — l,j — 1, {a, (3)) < (a,^)) = 1. 

Therefore, in this case we have /(z, j, {a, (3)) = f{i — l,j — 1, {a, 13)). 

(2.2) If Xi = yj = zi, then /(z, j, (a,/3)) = I > 0 and z = zi ■ ■ ■ zi is an LCS of X[1 : i] and Y[1 : j] with 
state {a, 13). 

Let the state of (zi,- ■ ■ ,zi-i) be (d,/3), then we have (a,/3) € S(a,/3,Xi), since Zi = Xi. It follows that 
zi • • • z/_i is a common subsequence of X[1 : z — 1] and Y[1 : j — 1] with state (d, j3). Therefore, we have 

/(*- lo'- l,(d,^)) > ^ - 1 


Furthermore, we have 


max |/(z — 1 , J 

(Q,/3)G5(a,/3,fCi) 


1, (a,/3))} > ^ - 1 


In other words, 


/(*, j, (a,/3)) < 1 


max {/(z - l,j - I, (a,/3))} 

{a,l3)GS{a,l3,Xi) 


(3) 


On the other hand, for any (d, P) G S{a, /?, Xi), and v = vi ■ ■ ■ Vh G Z(i — I, j — I, (d, /3)), v (B Xi is a 
common subsequence of X[1 : i] and Y[1 : j] with state {a, (3). Therefore, f{i,j,{a,f3)) = I > 1 + h = 
I + /(z — 1, j — 1, (d, P)), and so we conclude that. 


/(*, j, (a,/3)) > 1 + _ max {/(z - 1, j - I, (a,/3))} 

(a,/3)GS(a,/3,a:^) 


Combining ^ and Q we have, in this case. 


(4) 


/(bJ, (a,/3)) = 1 + _ max {/(z - 1, j - I, (a,^))} 

(a,/3)eS{a,f3,Xi) 


(5) 


Combining the two subcases in the case of Xi = yj, we conclude that the recursive formula Q is correct 
for the case Xi = yj. 

The proof is complete. ■ 


4 The Implementation of the Algorithm 

According to Theorem our algorithm for computing /(z, j, (a,/!)) is a standard 3-dimensional dynamic 
programming algorithm. By the recursive formula Q, the dynamic programming algorithm for computing 
(ct, (3)) can be implemented as the following Algorithm I. 

In Algorithm 1, T is the keyword tree for set P. The root of the keyword tree is numbered 0, and 
the other nodes are numbered 1,2, - • • — 1 in their preorder numbering. 6 {a,c) is the Aho-Corasick-next 

function defined in Definition]^ which can be computed in 0(1) time. The function 7 ( 0 ,/3,c) is defined 
in Definition]^ which can be computed in 0{d) time. The variable S is used to record the current states 
created. When the node of its output set is not empty is reached, a new state may be created. Therefore, in 
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Algorithm 1 M-STR-IC-LCS 

Input: Strings X = xi- • ■ Xn, Y = yi ■ ■ -ym of lengths n and m, respectively, and a set of d constraints 
P — {Pi, • • •, Pd} of total length r 

Output: The length of an LCS of X and Y including P 
1 ; Build a keyword tree T for P 
2: for all i,j, 0 < i < n ,0 < j < m do 

3: f{i, 0, (0, 0)) ^ 0, /(O, j, (0, 0)) ^ 0 {boundary condition} 

4; end for 

5: S' ^ 1(0, 0)1 (current set of states} 

6 ; for i = 1 to n do 
7: for J = 1 to TO do 

8 ; for each {a, /3) S S do 
9: if Xi ^ yj then 

10: fli,j, ia,l3)) ^ max{/(z - 1 , j, {a,/S)), f{i, j - 1, (a,/?))} 

11 : else 

12 : d <r- S(a,xj), P j{a,P,c), ^ Sp[jOa 

13: {d,P)) ^ max{/(z - l,j - 1, (d,/3)), ! + /(*- l,j - 1, (a,/3))} 

14: S^S\J{id,p)} 

15: end if 

16: end for 

17: end for 

18: end for 

19: return max {/(n, to, (i, 


Algorithm 1, the current state set S is extended gradually while the for loop processed. In the worst case, 
the set S will have a size of t2‘^ = 0{2‘^r), where r is the total lengths of the constrained strings. The body of 
the triple for loops can be computed in 0{d) time in the worst case. Therefor, the total time of Algorithm 1 
is 0{d2‘^nmr). The space used by Algorithm 1 is 0{2‘^nmr). In the case of the number of constraint strings 
is fixed, i.e. d is a constant, our new algorithm for the M-STR-IC-LCS problem requires 0{nmr) time and 
space. 

The number of constraints is an influent factor in the time and space complexities of our new algorithm. 
If a string Pi in the constraint set P is a proper substring of another string Pj in P, then an LCS of X and Y 
including Pj must also include Pi. For this reason, the constraint string Pi can be removed from constraint 
set P without changing the solution of the problem. Without loss of generality, we can make the following 
two assumptions on the constraint set P. 

Assumption 1 There are no duplicated strings in the constraint set P. 

Assumption 2 No string in the constraint set P is a proper substring of any other string in P. 

If Assumption I is violated, then there must be some duplicated strings in the constraint set P. In this 
case, we can first sort the strings in the constraint set P, then duplicated strings can be removed from P 
easily and then Assumption 1 on the constraint set P is satisfied. It is clear that removed strings will not 
change the solution of the problem. 

For Assumption 2, we first notice that a string A in the constraint set P is a proper substring of string 
B in P, if and only if in the keyword tree T of P, there is a directed path of failure links from a node v on 
the path from the root to the leaf node corresponding to string B to the leaf node corresponding to string 
A [HE]. For example, in Fig.l, there is a directed path of failure links from node 5 to node 7 and thus we 
know the string ba corresponding to node 7 is a proper substring of string aba corresponding to node 5. 

With this fact, if Assumption 2 is violated, we can remove all proper substrings from the constraint set 
P as follows. We first build a keyword tree T for the constraint set P, then mark all the leaf nodes pointed 
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by a failure link in T by using a depth first traversal of T. All the strings corresponding to the marked leaf 
node can then be removed from P. Assumption 2 is now satisfied on the new constraint set and the keyword 
tree T for the new constraint set is then rebuilt. It is not difficult to do this preprocessing in 0{r) time. It 
is clear that the removed proper substrings will not change the solution of the problem. 

If we want to compute the longest common subsequence of X and Y including P, but not just its length, 
we can also present a simple recursive backtracking algorithm for this purpose as the following Algorithm 2. 

In the end of our new algorithm, we will find an index a such that /(n, m, (a, 2*^ — 1)) gives the length of 
an LCS of X and Y including P. Then, a function call back{n, to, {a, 2’^ — 1)) will produce the answer LCS 
accordingly. 


Algorithm 2 back{i,j, {a, 13)) 

Comments: A recursive back tracing algorithm to construct the answer LCS 
1 ; if t = 0 or j = 0 then 
2: return 

3; end if 
4: if Xi = yj then 

5: if fiiJ,{a,l3)) = fii - 1,J - 1, (a,/3)) then 
6 ; back{i — 1, j — 1, {a, f3)) 

7; else 

8 : for each (d, ,5) € S' do 

9: if a = 6{a, Xi) and f3 = 7 ( 0 , P, Xi) and {a, (3)) = 1 + f{i — l,j — 1, {a, j3)) then 

10 ; back{i — 1, j — 1, {a, (3)) 

11 ; print Xi 

12 ; end if 

13; end for 

14; end if 

15; else if f{i - 1, j, {a, l3)) > f{i,j - 1, (a,/3)) then 
16; back{i — 1, j, {a, j3)) 

17; else 

18; back{i, j — l,{a, j3)) 

19; end if 


Since the cost of 5{k,Xi) is 0(1) in the worst case, the time complexity of the algorithm back{i,j,k) is 
0 {n + to). 

Finally we summarize our results in the following Theorem. 

Theorem 2 For the two input sequences X = X 1 X 2 • ■ ■ Xn and Y = j/ij /2 ■ ■ ■ym of lengths n and to, re¬ 
spectively, and a set of d constraints P = {Pi, - ■ • ,Pd} of total length r, the Algorithms 1 and 2 solve the 
M-STR-IC-LCSproblem correctly in 0{d2’^nmr) time and 0{2‘^nmr) space, with preprocessing time 0(r|E|). 
In the case of the number of constraint strings is fixed, the Algorithms 1 and 2 for the M-STR-IC-LCS problem 
require 0{nmr) time and space. 

5 Concluding Remarks 

We have suggested a new dynamic programming solution for the new generalized constrained longest common 
subsequence problem M-STR-IC-LCS. The new dynamic programming algorithm requires 0{d2‘^nmr) time 
in the worst case. In the case of the number of constraint strings d is fixed, our new afgorithm for the 
M-STR-IC-LCS problem requires 0{nmr) time and space, and thus this is a polynomial time algorithm. If 
d is not fixed, the time complexity 0{d2‘^nmr) is still exponential in its expression. It is not clear whether 
there is an efficient afgorithm in this case. We conjecture that our new afgorithm is stiff polynomial even 
though d is not fixed. We wilf investigate this issue further. 





References 

[1] Aho A.V., Corasick M.J., Efficient string matching: an aid to bibliographic search, Commun ACM 
18(6), 1975, pp. 333-340. 

[2] Ann H.Y., Yang C.B., Tseng C.T., Hor C.Y., A fast and simple algorithm for computing the longest 
common subsequence of run-length encoded strings. Inform Process Lett 108(11), 2008, pp.360-364. 

[3] Ann H.Y., Yang C.B., Peng Y.H., Liaw B.C., Efficient algorithms for the block edit problems, Inf 
Comput 208(3),2010, pp. 221-229. 

[4] Apostolico A., Guerra C., The longest common subsequences problem revisited, Algorithmica 2(1),1987, 
pp.315-336. 

[5] Arslan A.N., Egecioglu O., Algorithms for the constrained longest common subsequence problems, Int 
J Found Comput Sci 16(6), 2005, pp. 1099-1109. 

[6] Blum C., Blesa M.J., Lpez-Ibnez M., Beam search for the longest common subsequence problem, Comput 
Oper Res 36(12), 2009, pp. 3178-3186. 

[7] Chen Y.C., Chao K.M., On the generalized constrained longest common subsequence problems, J Comb 
Optim 21(3), 2011, pp. 383-392. 

[8] Chin F.Y.L.,Santis A.D.,Ferrara A.L.,Ho N.L.,Kim S.K., A simple algorithm for the constrained se¬ 
quence problems. Inform Process Lett 90(4), 2004, pp. 175-179. 

[9] Crochemore M.,Hancart C., and Lecroq T., Algorithms on strings, Cambridge University Press, Cam¬ 
bridge, UK, 2007. 

[10] Deorowicz S., Quadratic-time algorithm for a string constrained LCS problem, Inform Process Lett 
112(11), 2012, pp. 423-426. 

[11] Deorowicz S., Obstoj J., Constrained longest common subsequence computing algorithms in practice, 
Comput Inform 29(3), 2010, pp. 427-445. 

[12] Farhana E., Ferdous J., Moosa T., Rahman M.S., Finite automata based algorithms for the generalized 
constrained longest common subsequence problems. In: Proceedings of the 17th international conference 
on string processing and information retrieval, SPIREIO, Los Cabos, Mexico, 2010, pp. 243-249. 

[13] Gotthilf Z., Hermelin D., Lewenstein M., Constrained LCS: hardness and approximation. In: Proceedings 
of the 19th annual symposium on combinatorial pattern matching, CPM’08, Pisa, Italy, 2008, pp. 255- 
262. 

[14] Gotthilf Z., Hermelin D., Landau G.M., Lewenstein M., Restricted LCS. In: Proceedings of the 17th 
international conference on string processing and information retrieval, SPIRE’10, Los Cabos, Mexico, 
2010, pp. 250-257. 

[15] Gusfield, D.,Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biol¬ 
ogy. Cambridge University Press, Cambridge, UK, 1997. 

[16] Hirschberg D.S., Algorithms for the longest common subsequence problem, J ACM 24(4), 1977, pp. 
664-675. 

[17] Hunt J.W., Szymanski T.G., A fast algorithm for computing longest common subsequences, Commun 
ACM 20(5), 1977, pp. 350-353. 

[18] Iliopoulos C.S., Rahman M.S., New efficient algorithms for the LCS and constrained LCS problems. 
Inform Process Lett 106(1), 2008, pp. 13-18. 


9 



[19] Iliopoulos C.S., Rahman M.S., A new efficient algorithm for computing the longest common subsequence, 
Theor Comput Sci 45(2), 2009, pp. 355-371. 

[20] Iliopoulos C.S., Rahman M.S., Rytter W., Algorithms for two versions of LCS problem for indeterminate 
strings, J Comb Math Comb Comput 71, 2009, pp. 155-172. 

[21] Iliopoulos C.S., Rahman M.S., Vorcek M., Vagner L., Finite automata based algorithms on subsequences 
and supersequences of degenerate strings, J Discret Algorithm 8(2), 2010, pp. 117-130. 

[22] Knuth D.E., Morris J.H.Jr, Pratt V., Fast pattern matching in strings, SIAM J Comput 6(2), 1977, pp. 
323-350. 

[23] Maier D., The complexity of some problems on subsequences and supersequences, J ACM 25, 1978, pp. 
322-336. 

[24] Peng Y.H., Yang C.B., Huang K.S., Tseng K.T., An algorithm and applications to sequence alignment 
with weighted constraints, Int J Found Comput Sci 21(1),2010, pp. 51-59. 

[25] Shyu S.J., Tsai C.Y., Finding the longest common subsequence for multiple biological sequences by ant 
colony optimization, Comput Oper Res 36(1), 2009, pp. 73-91. 

[26] Tang C.Y., Lu C.L., Constrained multiple sequence alignment tool development and its application to 
RNase family alignment, J Bioinform Comput Biol 1, 2003, pp. 267-287. 

[27] Tsai Y.T., The constrained longest common subsequence problem. Inform Process Lett 88(4), 2003, pp. 
173-176. 

[28] Tseng C.T., Yang C.B., Ann H.Y., Efficient algorithms for the longest common subsequence problem 
with sequential substring constraints, J Complexity 29, 2013, pp. 44-52. 

[29] Wagner R., Fischer M., The string-to-string correction problem, J ACM 21(1), 1974, pp. 168-173. 

[30] Wang L., Wang X., Wu Y., Zhu D., A dynamic programming solution to a generalized LCS problem. 
Inform Process Lett 113(1), 2013, pp. 723-728. 


10 



